Goto

Collaborating Authors

 step size 0


Trajectory-Level Data Augmentation for Offline Reinforcement Learning

arXiv.org Machine Learning

We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.



A Implementation Details

Neural Information Processing Systems

For all the experimental results on ResNet-29 v2 (He et al., 2016b), we use a batch size The network is trained with Adam optimizer (Kingma et al., 2015) for 200 epochs. We randomly split the training dataset into training data of 45000 images and 5000 images as the validation set. We train a Wide ResNet-28-10 v2 (Zagoruyko & Komodakis, 2016) to obtain the state-of-the-art accuracy for CIFAR-10 (e.g., Table 2 in the main text). For mixup (Zhang et al., 2018; Thulasidasan et al., 2019), the mixing parameter of two images is For CCA T (Stutz et al., 2020), we observe that training models with adversarial examples bounded We train a Wide ResNet-28-10 v2 (Zagoruyko & Komodakis, 2016) to obtain the state-of-the-art accuracy for CIFAR-100. All the experiments on ImageNet were obtained via training a ResNet-101 v1 (He et al., 2016a) following the training script at The input image is normalized (divided by 255) to be within [0,1].



New logarithmic step size for stochastic gradient descent

arXiv.org Artificial Intelligence

Stochastic gradient descent (SGD), which dates back to the work by Robbins and Monro Robbins and Monro [1951a] is widely observed in training modern Deep Neural Networks (DNNs), which are widely used to achieve state-of-the-art results in multiple problem domains like image classification problems Krizhevsky et al. [2017, 2009], object detection Redmon and Farhadi [2017], and classification automatic machine translation Zhang et al. [2015]. The value of the step size (or learning rate) is crucial for the convergence rate of SGD. Selecting an appropriate step size value in each iteration ensures that SGD iterations converge to an optimal solution. If the step size value is too large, it may prevent SGD iterations from reaching the optimal point. Conversely, excessively small step size values can lead to slow convergence or mistakenly identify a local minimum as the optimal solution Mishra and Sarawadekar [2019]. To address these challenges, various schemes have been proposed. One popular approach is the Armijo line search method, initially introduced for SGD by Vaswani et al. Vaswani et al. [2019], which provides theoretical results for strong-convex, convex, and non-convex objective functions.


Permutation Compressors for Provably Faster Distributed Nonconvex Optimization

arXiv.org Machine Learning

We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on {\em independent} stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially {\em correlated} compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name {\em Hessian variance}, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of {\em random permutations}, for which we coin the term Perm$K$, the use of which leads to $O(\sqrt{n})$ (resp. $O(1 + d/\sqrt{n})$) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when $d\geq n$ (resp. $d \leq n$), where $n$ is the number of workers and $d$ is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.


Scalable MCMC for Mixed Membership Stochastic Blockmodels

arXiv.org Machine Learning

We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition we develop an approximation that can handle models that entertain a very large number of communities. The experimental results show that SG-MCMC strictly dominates competing algorithms in all cases.